-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Write sharded repodata #161
base: main
Are you sure you want to change the base?
Conversation
…hards for patching
@@ -91,6 +92,23 @@ | |||
repodata_version=2 which is supported in conda 24.5.0 or later. | |||
""", | |||
) | |||
@click.option( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could replace this with an "add-only" or "no-remove" option, that would keep packages in the index even if they are not found in the filesystem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two options are more about testing from a backup of the conda-forge database, they may not survive into the main branch or we could make them easier to use.
Now that the CEP is approved, this branch should be completed and become the way to run conda-index. |
Combining the sharded CLI into the main CLI
show_default=True, | ||
) | ||
@click.option( | ||
"--upstream-stage", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The stat table has multiple rows per package and stage. If a package only exists in stage=fs
or clone
then its metadata is cached and a stage=index
row is added. This option doesn't explain the mechanism.
def _make_rss(channel_name, channeldata): | ||
return rss.get_rss(channel_name, channeldata) | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anticipating a "no conda dependency" or "optional current_repodata" feature, we move these to another module.
@@ -481,6 +339,9 @@ class ChannelIndex: | |||
:param channel_url: fsspec URL where package files live. If provided, channel_root will only be used for cache and index output. | |||
:param fs: ``MinimalFS`` instance to be used with channel_url. Wrap fsspec AbstractFileSystem with ``conda_index.index.fs.FsspecFS(fs)``. | |||
:param base_url: Add ``base_url/<subdir>`` to repodata.json to be able to host packages separate from repodata.json | |||
:param save_fs_state: Pass False to use cached filesystem state instead of ``os.listdir(subdir)`` | |||
:param write_monolithic: Pass True to write large 'repodata.json' with all packages. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is repodata.json appropriately called the "monolithic" option?
@@ -557,7 +439,7 @@ def index( | |||
# begin non-stop "extract packages into cache"; | |||
# extract_subdir_to_cache manages subprocesses. Keeps cores busy | |||
# during write/patch/update channeldata steps. | |||
def extract_subdirs_to_cache(): | |||
def extract_subdirs_to_cache(): # is the 'prepare' step in 'index_prepared_subdir' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
index_prepared_subdir
was renamed, we could say that we have to load metadata into the cache before we can generate-repodata-out-of-it.
# exactly these packages (unless they are un-indexable) will | ||
# be in the output repodata | ||
if self.save_fs_state: | ||
cache.save_fs_state(subdir_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we use this flag OR overriding the save_fs_state() function to be a noop
Description
What would it look like to generate sharded repodata per conda/ceps#75
Interested in seeing whether we can efficiently generate shards; how repodata patching should work; and whether we could generate shards as the primary artifact and then derive repodata.json from shards [at the same time in "processes a sequence of package names" code]
How to test
Check out this repository and https://github.com/dholth/conda-test-data
Decompress
conda_test_data/conda-forge/*/.cache/cache.db.zst
Download
conda-forge-repodata-patches-<version>.conda
python3 -m conda_index --write-shards --no-write-monolithic --upstream-stage=clone --no-update-cache --patch-generator ~/miniconda3/pkgs/conda-forge-repodata-patches-20240401.20.33.07-hd8ed1ab_1.conda --output /tmp/shards ~/prog/conda-test-data/conda-forge
Examine output in
/tmp/shards
I've begun trying to apply the patches to individual shards. This is slow; should compare against applying the many patches against a whole repodata.json.
Checklist - did you ...
news
directory (using the template) for the next release's release notes?